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Abstract —  We  have  modified  the  publicly  available  electromagnetic  scattering  code  called 
DDSCAT  to  handle  bigger  targets  with  shorter  execution  times  than  the  original  code.  A  big  tar¬ 
get  is  one  whose  sphere  equivalent  radius  is  large  compared  to  the  incident  wavelength.  DDSCAT 
uses  the  discrete-dipole  approximation  and  an  accurate  result  requires  the  spacing  between  dipoles 
to  be  small  relative  to  the  wavelength.  This  requirement,  along  with  the  big  target  size,  implies 
that  the  number  of  discrete  grid  cells  defining  the  computational  grid  box  surrounding  the  target 
must  be  large.  The  memory  requirement  is  thus  significant  and  the  execution  time  long.  The 
original  version  of  the  code  cannot  handle  large  enough  targets  that  we  would  like  to  consider 
because  they  require  memory  exceeding  what  is  available  on  a  single  node  of  our  cluster  and 
execution  time  extending  to  days. 

We  have  removed  two  limitations  of  the  original  version  of  the  code.  First,  to  speed  up  the 
execution  involing  multiple  target  orientations,  the  original  code  assigned  one,  but  only  one, 

MPI  process  to  each  independent  orientation.  We  surmount  this  limitation  by  assigning  each 
orientation  a  team  of  processes.  Second,  the  original  code  allocated  all  the  memory  for  the  full 
problem  for  each  MPI  process.  We  surmount  this  limitation  by  decomposing  the  data  structures 
across  the  members  of  a  team.  With  these  changes,  the  new  version  can  handle  much  bigger 
targets.  Moreover,  it  exhibits  strong  scaling  for  fixed  problem  size.  The  execution  time  decreases 
very  nearly  1/p  as  the  processor  count  p  increases.  Execution  time  measured  in  days  with  the 
original  code  can  now  be  reduced  to  fractions  of  an  hour. 

1.  INTRODUCTION 

The  discrete-dipole  approximation  (DDA),  based  on  the  original  work  of  Purcell  and  Penny- 
packer  [7],  is  a  standard  method  for  calculating  electromagnetic  scattering  cross  sections  of  ir¬ 
regularly  shaped  targets  [6].  One  of  the  more  effective  codes  for  this  kind  of  calculation  is  the 
DDSCAT  code  from  Draine  and  Flatau  [1],  A  comparable  one  is  the  ADDA  code  from  Yurkin  and 
Hoekstra  [9] .  Kahnert  provides  a  review  of  other  solution  methods  in  a  recent  review  paper  [3] . 

DDA  reduces  the  scattering  problem  to  the  solution  of  a  large,  dense,  complex  symmetric  system 
of  linear  equations, 

Ax  =  b  (1) 

Design  of  a  code  to  solve  these  equations  and  the  resulting  performance  of  the  code  depend  on 
the  data  structures  chosen  to  represent  the  matrix  and  the  solution  technique  selected  to  solve  the 
system  of  equations.  Since  the  matrix  is  a  symmetric  Toeplitz  matrix,  only  the  elements  of  a  single 
row  need  be  stored,  and  these  elements  can  be  decomposed  and  distributed  across  processors  in 
a  straightforward  way.  To  solve  the  system  of  equations,  the  community  has,  after  many  years  of 
expermentation  and  analysis,  settled  on  using  one  or  another  variation  of  Krylovbased  solvers  [8, 
Chs.  6-7].  These  iterative  solvers  require  several  matrix-vector  multiplications  at  each  iteration. 
Since  the  matrix  is  Toeplitz,  representing  a  convolution,  this  operation  can  be  performed  using 
Fast  Fourier  Transforms.  The  matrix-vector  multiplication,  then,  becomes  a  forward  FFT  of  two 
vectors  followed  by  multiplication  in  Fourier  space  followed  by  inverse  FFT  back  to  physical  space. 

The  parallel  strategy  adopted  in  the  original  version  of  the  code  assigns  a  separate  MPI  process 
to  each  independent  target  orientation.  Each  MPI  process  allocates  all  the  memory  for  the  full 
problem,  which  may  exceed  the  memory  available  on  a  node  for  big  targets,  and  the  number  of 
processes  that  could  be  used  is  limited  by  the  number  of  orientations.  The  new  code  assigns  a 
team  of  MPI  processes  to  each  independent  orientation  and  partitions  the  data  structures  by  the 
size  of  the  team.  The  problem  now  fits  in  memory  and  more  processes  can  be  used  to  reduce  the 
execution  time. 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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2.  THE  DISCRETE-DIPOLE  APPROXIMATION 

DDA  surrounds  the  target  with  a  three-dimensional  computational  grid  box  discretized  into  cubical 
cells  of  volume  d3  with  a  dipole  located  at  the  center  of  each  of  the  cells  occupied  by  the  target  [7]. 
If  there  are  n  dipoles  in  the  target,  the  volume  occupied  by  these  dipoles  is  nd3,  resulting  in  an 
sphere-equivalent  radius  of  a,  defined  by  the  relationship,  nd 3  =  (47r/3)a3.  Scattering  cross  sections 
are  often  measured  in  units  equal  to  the  cross  sectional  area,  ira2  [2,  Ch.  16,  4,  Ch.  2], 

The  accuracy  of  the  method  requires  [1, 10, 11]  that  the  grid  spacing  between  cells,  d,  be  small 
relative  to  the  wavelength, 

2|m|A;<i  <  1  (2) 

where  m  is  the  complex  refractive  index  of  the  target  and  k  =  2n/\  is  the  wavenumber. 

A  target  is  considered  “big”  if  its  size  parameter,  ka,  is  much  bigger  than  one,  i.e.,  ka  3>  1. 
Coupled  with  requirement  (2),  it  implies  a  large  number  of  dipoles,  n  >  (2|m|  ka)3,  and  almost 
always  an  even  larger  number  of  grid  cells.  The  memory  requirement  for  a  large  target  may  thus 
exceed  what  is  available  on  a  single  computer  or  a  node  in  a  cluster.  Furthermore,  the  computational 
work  that  must  be  done  to  solve  the  system  of  equations  increases  with  the  size  of  the  grid  box  and 
the  execution  time  for  large  targets  may  easily  expand  to  days  if  only  a  single  node  can  be  used.  To 
address  these  two  problems,  we  decompose  the  computational  box  such  that  each  partition  of  the 
box  fits  into  the  memory  of  a  single  node,  and  we  assign  a  team  of  processors  to  work  in  parallel 
on  each  partition  to  reduce  the  execution  time. 

3.  THE  PARTITIONED  COMPUTATIONAL  PROBLEM 

The  parallel  implementation  strategy  for  any  application  code  involves  two  issues:  how  to  partition 
data  structures  and  how  to  assign  work.  The  original  code  takes  into  consideration  one  of  these 
issues,  the  assignment  of  work,  but  does  not  consider  the  partition  of  data  structures.  The  new 
version  addresses  both  issues. 

The  details  of  the  partition  depend  on  the  details  of  how  the  system  of  Equation  (1)  is  represented 
on  the  grid  points  of  the  discretized  computational  grid  box.  The  vector  on  the  right  side  is 
proportional  to  the  incoming  plane  wave,  exp(ik-r),  with  wavelength  A,  wavenumber  k  =  |k|, 
and  the  direction  of  the  wave  vector  relative  to  the  target  by  three  Euler  angles  (a,/?, 7).  At  each 
grid  point  of  the  computational  grid  box,  the  vector  on  the  right  side  of  the  system  of  equations, 
bijk  =  exp  [ik  ■  (n,  r-j,  77)],  has  values  proportional  to  the  value  of  the  plane  wave  at  each  grid  point 
r  =  (r'i,  Tj,  77).  In  the  same  way,  the  values  of  the  elements  of  the  matrix  A  are  evaluated  relative 
to  the  grid  points,  and  the  solution  vector  x  is  found  at  each  grid  point.  In  practice,  indices  for  the 
points  on  the  three-dimensional  grid  are  serialized  so  that  the  quantities  b  and  x  can  be  thought 
of  as  vectors  of  length  n3  and  the  matrix  A  can  be  thought  of  as  a  matrix  of  size  n3  x  n3. 

The  decomposition  strategy  is  straightforward.  The  vectors  involved  are  partitioned  along  one 
of  the  directions  of  the  computational  grid  box.  If  the  grid  points  have  been  serialized  in  lexical 
order,  the  x-direction  first,  followed  by  the  y-direction  and  then  by  the  ^-direction,  the  vectors  can 
be  partitioned  in  the  ^-direction  by  the  team  size  (i.e.,  number  of  processes  per  team)  assigned  to 
the  partition.  Correspondingly,  the  rows  of  the  matrix  are  partitioned  along  the  ^-direction.  This 
partitioning  introduces  overhead  into  the  code  due  to  the  requirement  for  moving  distributed  data 
structure  among  team  members,  but  for  big  targets,  this  overhead  is  small  as  our  results  show  in 
later  sections. 

4.  PARALLEL  CONVOLUTION 

Special  properties  of  the  matrix  A  make  the  problem  tractable.  The  matrix  is  a  function  of  the 
magnitude  of  the  plane  wave  vector,  k  =  |k|,  not  its  direction,  i.e.,  A  =  A (fc).  The  matrix  can  thus 
be  calculated  once  for  each  wavelength  and  used  for  all  orientations  of  the  incoming  wave.  It  is 
also  a  Toeplitz  matrix.  The  values  of  its  elements  depend  on  the  distance  between  grid  points  such 
that,  in  the  serialization  representation,  the  matrix  elements  have  the  property,  at]  =  ai-j.  It  can 
then  be  represented  by  one  of  its  rows,  a  vector  of  length  n3,  rather  than  a  matrix  of  size  n3  x  n3. 

The  vector  b(k)  and  the  solution  vector  x(k),  on  the  other  hand,  are  necessarily  functions  of 
both  the  magnitude  and  the  orientation  of  the  incoming  wave.  Thus,  for  each  orientation,  the 
matrix  remains  the  same  but  the  right-hand-side  vector  changes  so  the  system  of  equations, 

A  (, k )  x  (k)  =  b  (k)  , 


must  be  solved  for  the  solution  vector  for  each  oreientation. 


(3) 
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When  iterative  Krylov  solvers  are  used  to  solve  this  system,  several  matrix- vector  multiplications 
of  the  kind:  y  =  Ax,  must  be  performed  for  each  iteration.  Since  the  matrix  is  Toeplitz,  its 
elements  depend  only  on  the  difference  of  the  indices,  and  the  matrix- vector  multiplication  becomes 
a  convolution,  yt  =  JK  ai-jXj.  The  operation  count  for  this  multiplication  is  on  the  order  of  the 
square  of  the  rank  of  the  matrix  if  done  in  the  usual  way.  But  since  the  Fourier  transform  of  a 
convolution  is  the  product  of  the  Fourier  transforms  of  the  the  two  factors,  T  (y)  =  T  (A)  •  T  (x), 
the  operation  count  can  be  reduced  to  the  order  of  the  matrix  size  times  the  logarithm  of  the 
matrix  size.  This  difference  results  in  a  signficant  saving  of  computation  time.  The  result  vector  is 
retrieved  with  the  inverse  transform,  y  =  [J-  (A)  •  T  (x)]. 

The  Fourier  transform  is  a  three-dimensional  transform  over  the  grid  points  of  the  computa¬ 
tional  box.  Since  the  box  has  been  decomposed  along  the  ^-direction,  data  communication  among 
processors  is  required,  which  is  not  required  on  a  single  processor.  Each  processor  can  perform  the 
transform  independently  for  the  grid  points  it  owns  in  the  xy- plane,  but  the  data  structure  must 
be  transposed  to  do  the  transform  in  the  z-direction.  The  matrix  can  be  transformed  once  for 
each  wavelength  and  left  in  the  transposed  data  structure  in  Fourier  space.  It  does  not  need  to  be 
transformed  back  to  the  original  decomposed  data  structure.  The  vectors  at  each  iteration  of  the 
Krylov  solver  are  also  left  in  transformed  Fourier  space  and  multiplied  by  the  matrix.  Only  then 
is  the  product  of  the  two  vectors  transformed  back  to  the  original  partitioned  data  structure. 

An  important  optimization  of  the  convolution  step  results  from  treating  the  zeros  that  must  be 
padded  onto  the  vectors  before  performing  the  Fourier  transform  [5] .  Total  time  for  the  transform, 
using  a  Fast  Fourier  Transform  (FFT)  library,  is  the  sum  of  the  time  in  each  of  the  three  dimensions, 
t\  =  tx  +  ty  +  tz.  If  the  dimensions  are  about  the  same  in  each  dimension,  the  time  for  the  transform 
is  the  same  in  each  direction,  tx  =  ty  =  tz  =  to,  and  t\  =  3fo-  During  the  FFT  in  the  x-direction, 
the  time  reduces  by  half  by  not  computing  over  the  zeros  in  the  y-direction  and  again  by  half  by 
not  computing  over  the  zeros  in  the  z-direction.  Likewise,  in  the  y-direction,  the  time  reduces  by 
half  by  not  computing  over  the  zeros  in  the  ^-direction.  The  time  in  the  ^-direction  remains  the 
same  so  the  reduced  time  becomes,  t-2  =  to/4  +  to/2  +  to  =  (7/4)  to,  and  the  speedup  in  performance 
becomes  t\/t2  =  3to/(7to/4)  =  12/7  ~  1.71. 

For  large  targets,  where  the  FFT  dominates  the  execution  time,  this  modification  of  the  code 
may  lead  to  signicant  performance  speedup.  For  smaller  targets,  where  the  FFT  is  only  a  fraction  of 
the  total  execution  time,  the  benefit  will  be  lower.  If  the  transpose  represents  50%  of  the  execution 
time,  the  benefit  may  only  be  about  20%.  We  have  measured  a  12%  benefit  per  grid  point  on  a 
single  processor  for  a  test  case. 

5.  PARALLEL  MATRIX  TRANSPOSE 

For  multiple  processors,  the  analysis  of  the  time  saved  due  to  the  treatment  of  the  padding  is 
somewhat  different  from  that  of  a  single  processor  due  to  the  nontrivial  cost  for  the  matrix  transpose. 
The  amount  of  data  transferred  between  the  xy-direction  to  the  ^-direction,  during  the  matrix 
transpose  reduces  by  a  factor  of  2.  For  large  problem  sizes,  we  have  observed  that  the  time 
required  for  the  transpose  was  comparable  to  the  total  time  for  the  FFT.  If  the  transpose  is 
bandwidth  limited,  the  net  speedup  for  FFT  with  transpose  can  be  estimated  as  t\/t2  =  (3to  + 
3fo)(3fo/2  +  7 to/4:)  =  24/13  —  1.85.  We  have,  in  fact,  measured  an  overall  speedup  equal  to  1.7x 
with  the  new  treatment  of  the  padding. 

6.  SCALING  ANALYSIS 

To  compare  performance  of  the  old  version  of  the  code  with  the  new  version,  we  examine  two 
different  targets  referred  to  here  as  medium  and  large.  The  medium  target  fits  within  the  memory 
of  a  single  node  on  our  machine,  and  allows  direct  performance  comparisons  between  the  two  code 
versions.  The  large  target,  does  not  fit  in  the  memory  of  a  single  node,  and  can  only  be  run  with 
the  new  version.  The  large  target  illustrates  that  the  new  version  of  the  code  can  handle  larger 
targets  and  that,  for  these  big  targets,  it  exhibits  strong  scaling. 

An  important  advantage  of  the  new  implementation,  is  that  it  allows  the  code  to  utilize  far 
more  processors.  Whereas  the  original  code  limited  the  number  of  MPI  processes  to  the  number  of 
target  orientations,  typically  0(  102)  the  new  version  uses  a  team  of  processes  for  each  orientation. 
The  number  of  MPI  processes  in  a  team  is  only  limited  by  the  size  of  the  computational  grid. 

Parallel  strategy  for  the  new  version  of  the  code  is  good  for  the  large  case  which  does  not  fit 
in  the  memory  of  a  single  node.  With  the  new  version  of  the  code,  we  use  teams  of  size  20  and 
distribute  the  MPI  processes  across  nodes  with  each  process  requiring  one-twentieth  of  the  memory. 
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Figure  1:  Execution  time  per  grid  point,  tp  =  t(p)/nxnynz ,  as  a  function  of  processor  count.  Large  target 
with  \m\kd  =  0.3563  contained  in  a  box  of  size  (nx,  ny,nz)  =  (256, 360, 360)  marked  with  bullets  (•)  measured 
in  seconds.  The  dotted  line  represents  perfect  scaling  of  p -1.  The  intercept  at  log(ri)  =  —1.3,  marked  with 
an  asterisk  (*),  suggests  that  the  problem  on  a  single  processor  would  require  at  least  19  days  to  complete 
compared  with  about  half  an  hour  on  800  processors. 


For  this  problem,  the  FFT  dominates  the  execution  time  taking  almost  99%  of  the  total.  This  part 
of  the  code  is  fully  parallelized  and  the  execution  time,  for  the  fixed  problem  size,  exhibits  strong 
scaling  —  decreasing  inversely  with  the  processor  count  as  illustrated  in  Figure  1. 

7.  ACCURACY  OF  THE  METHOD 

The  accuracy,  as  well  as  the  performance,  of  codes  that  implement  the  discrete-dipole  approximation 
depends  on  a  number  of  empirically  determined  parameters  [10, 11].  The  distance  between  cells,  for 
example,  must  be  small  relative  to  the  wavelength  of  the  incident  wave  as  displayed  in  inequality  (2). 
The  rate  of  convergence  of  the  iterative  solver  also  depends  on  the  value  of  this  parameter,  smaller 
values  generating  larger  matrices  requiring  more  iterations  to  converge. 

Now  that  the  execution  time  has  been  reduced  to  a  few  minutes  of  time  for  very  big  targets,  we 
can  perform  parameter  studies  to  test  the  accuracy  of  the  results.  In  particular,  we  can  follow  the 
technique  suggested  by  Yurkin  and  coworkers  [11]  to  extrapolate  the  results  to  zero  cell  size. 

8.  SUMMARY 

The  major  changes  to  the  code,  after  decomposing  the  data  structures,  required  a  3D  transpose 
for  the  FFT  and  a  global  reduction  for  the  scalar  products  and  norms  in  the  iterative  solver.  Data 
decomposition  was  by  far  the  hardest  change  to  make  to  the  original  code.  Performance  of  the  code 
improved  when  we  removed  unnecessary  calls  to  the  FFT  routine  for  padded  zeros  required  for  the 
convolution.  For  big  targets,  the  new  code  exhibits  almost  perfect  strong  scaling  as  the  processor 
count  increases. 
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